export const metadata = {
  title: 'Shipping multi-tenant SaaS using Postgres Row-Level Security',
  sizzle:
    'At Nile, we’re making it easier for companies to build world-class control planes for their infra SaaS products',
  authors: ['miki'],
  tags: ['database', 'postgres', 'rls'],
};

# Shipping multi-tenant SaaS using Postgres Row-Level Security

At Nile, we’re making it easier for companies to build world-class control planes for their infra SaaS products. Multi-tenancy is core to all SaaS products and especially those with control-plane architectures. At Nile, we’ve built multi-tenancy into our product from day one. If you are working on an infra SaaS product and need a multi-tenant control plane, you should <a href="https://thenile.dev?utm_source=blog.nile&utm_medium=blog&utm_campaign=pg_rls_blog&utm_content=top_link" rel=""> talk to us</a>.

From previous experience, we’re familiar with multiple multi-tenant SaaS
architecture options. We decided to store everything in a single Postgres schema since it provides a
balance of scalability, cost optimization, and flexibility. However, this requires serious investment in
database authorization to guarantee that we never leak customer data.

Authorization in a multi-tenant db is something many companies have to deal with, and in previous
companies, I saw authorization implemented in probably the most common way:
appending `WHERE user_id = $USER_ID` to queries. This is also the way things started out at
Nile, but as we added more features we noticed that we were forced to add many branching and
repetitive `WHERE`s to our code. We needed a solution that would allow us to add features quickly
and confidently, and using custom filters in every single query was error-prone and hard to evolve
if our data model changed.

![RLS code excerpt](/img/rls_blog/code_excerpt.png)

One solution that I knew about was
Postgres [Row-Level Security](https://www.postgresql.org/docs/9.5/ddl-rowsecurity.html) (RLS), a
db-level mechanism that filters rows based on a per-user basis. I expected it would allow us to
iterate faster and dramatically reduce security risks. You can learn the basics with
these [two](https://www.bytefish.de/blog/spring_boot_multitenancy_using_rls.html) [blogs](https://callistaenterprise.se/blogg/teknik/2020/10/24/multi-tenancy-with-spring-boot-part6/)
that show how to build multi-tenant apps using Postgres RLS. As with most solutions, the blog
version was easy to implement, but there was an especially long tail to ship to production.

In this blog post, I’ll talk about the alternatives we considered - both for multi-tenant
architecture and for securing data access - why we chose RLS, and the various challenges we
encountered and overcame while shipping it to production.

## Existing multi-tenancy solutions

### Schema-per-tenant and database-per-tenant

We considered both of these approaches but went with the single-schema approach for its minimal
operational complexity, low cost, and ability to scale later on. I won’t go into detail about these
approaches, as there are countless resources on the topic. Here are two resources I’ve found to be
helpful:

1. [Multi-tenant SaaS patterns - Azure SQL Database | Microsoft Docs](https://docs.microsoft.com/en-us/azure/azure-sql/database/saas-tenancy-app-design-patterns?view=azuresql)
2. A great paper from Microsoft -[Multi-Tenant Data Architecture](https://renatoargh.files.wordpress.com/2018/01/article-multi-tenant-data-architecture-2006.pdf)

### Single schema with dynamic WHERE queries

#### Pros

1. Easiest and most straightforward zero-to-one solution.
2. Transparent and easy to reason about.

#### Cons

1. Possibility of forgetting to add a filter to a query. Since queries are permissive by default,
   this
   is easy to miss and hard to detect without extensive testing. There are some solutions to this (
   i.e: [@Filter in Hibernate](https://callistaenterprise.se/blogg/teknik/2020/10/17/multi-tenancy-with-spring-boot-part5/))
   but I find that ORMs make simple querying easier and complicated querying harder. At Nile, our
   authorization model is complicated enough that we didn’t want to rely on Hibernate for this.
2. Repetitive, ugly, and annoying to implement. Imagine you have 20 API endpoints that require
   authorization and 2 different types of roles, `USER` and `ADMIN`. The access controls for these
   two
   roles are different, so you might have to define 40 `WHERE`s across your codebase. This doesn't scale well when adding new roles or modifying existing ones across many API endpoints.

### External authorization systems

#### Pros

1. Highly flexible
2. (Claim to be) scalable

#### Cons

1. $$$ cost, if managed. Operational cost, if self-hosted.
2. Unnecessary if the permissioning model isn’t particularly complicated. At Nile, so far it’s not.
3. External dependencies often make testing more difficult and reduce engineering velocity. The
   benefits have to outweigh these costs.
4. As a control plane, multi-tenancy is core to our product. We believe in building foundational
   capabilities in-house so that we can push the envelope rather than be constrained by external
   solutions.

### What might a better solution look like?

After we chose to use a single multi-tenant schema, we were looking for a solution that would be
cleaner and less error-prone than dynamic queries and lighter than an external authorization
system.

In the rest of this blog post, I’ll lay out what I discovered about RLS in the few weeks I spent
researching and implementing it at Nile, and how it solved our problem (at least for now) of
building authorization with speed, confidence, and maintainable architecture.

## A quick overview of RLS

The high-level process to set up RLS is:

1. Define your data model as usual, but include a tenant identifier in every table
2. Define RLS policies on your tables (i.e: “only return rows for the current tenant”)
3. Define a db user (i.e: `app_user`) with all the privileges your application will need to interact
   with the db, but without any superuser roles. In Postgres, this is necessary
   since [superuser roles bypass all permission checks](https://www.postgresql.org/docs/current/sql-createrole.html)
   , including RLS (more on that later).

### A simple org access control example

Imagine your API has an `/orgs` endpoint that should only return organizations the calling user
is a member of. To achieve this via RLS, you’d define your tables, policies, and db user as such:

```sql title=rls_policy_setup.sql
CREATE
  TABLE
    users(
      id SERIAL PRIMARY KEY
    );

CREATE
  TABLE
    orgs(
      id SERIAL PRIMARY KEY
    );

CREATE
  TABLE
    org_members(
      user INTEGER REFERENCES users NOT NULL,
      org INTEGER REFERENCES orgs NOT NULL
    );

-- ** RLS setup **
ALTER TABLE
  orgs ENABLE ROW LEVEL SECURITY;

-- Create a function, current_app_user(),
-- that returns the user to authorize against.
CREATE
  FUNCTION current_app_user() RETURNS INTEGER AS $$ SELECT
    NULLIF(
      current_setting(
        'app.current_app_user',
        TRUE
      ),
      ''
    )::INTEGER $$ LANGUAGE SQL SECURITY DEFINER;

CREATE
  POLICY org_member_policy ON
  orgs
    USING(
    EXISTS(
      SELECT
        1
      FROM
        org_members
      WHERE
        user = current_app_user()
        AND org = id
    )
  );

-- Create the db user that'll be used in your application.
CREATE
  USER app_user;

GRANT ALL PRIVILEGES ON
ALL TABLES IN SCHEMA public TO app_user;

GRANT ALL PRIVILEGES ON
ALL SEQUENCES IN SCHEMA public TO app_user;
```

The above RLS policy will only return true for organizations that the current user is a member of.
Simple enough. Later on, we’ll see how things can get more complicated.

Note the `current_app_user()` function. In the traditional use case of direct db access, RLS works
by defining policies on tables that filter rows based on the current db user. For a SaaS
application, however, defining a new db user for each app user is clunky. For an application use
case you can dynamically set and retrieve users using
Postgres’ [current_settings()](https://pgpedia.info/c/current_setting.html) function (
i.e: `SET app.current_app_user = ‘usr_123’` and `SELECT current_settings(‘app.current_app_user)`).

### What it looks like from request to response

![Request to response diagram](/img/rls_blog/request_to_response.svg)

## Why we chose RLS

### It fails by default - and therefore secure by default

The biggest benefit of RLS is that if you define a policy that’s too restrictive, or forget to
define a policy, things just fail. Compared to dynamic queries where forgetting to add a `WHERE`
will leak data, this is a big win for security. I didn’t appreciate this until I wrote some
integration tests for access patterns (i.e: testing if a user can access orgs they’re a part of).
Initially, all the tests failed, and for cases where users should have access tests only passed when
I added the appropriate RLS policies.

RLS is, of course, not a silver bullet. Accidentally defining an overly permissive policy is hard to
catch without extensive tests so it’s important to still be careful.

### Defined once, applied everywhere

One of the main challenges with dynamic queries in single-schema multi-tenancy is that changes to
tables often require touching many different queries. RLS solves this problem since **policies are
tied to tables and not queries**. After modifying a table, all you need to do is to change its
access
policies, which will be applied to all queries.

### Composability

With RLS, it’s easy to add more access rules as your multi-tenant data model evolves. According to
the
[Postgres docs](https://www.postgresql.org/docs/current/ddl-rowsecurity.html#:~:text=When%20multiple%20policies%20apply%20to,they%20are%20a%20member%20of.):

> _“When multiple policies apply to a given query, they are combined using either OR (for permissive
> policies, which are the default) or using AND (for restrictive policies).”_

Since by default policies are combined with OR, this makes it super easy to define more policies as
your access rules get more complex. This isn’t so straightforward with dynamic queries, where you
might have to define your own logic for combining access rules. Or, as probably many of us have seen
before, just create monster `WHERE` statements.

### Separation of Concerns

Instead of mixing filters that are related to our application logic with filters that are related to
the multi-tenant database design in the same WHERE clauses, we now have a clean separation:

- Our application applies all the filters that are requested by users through APIs and other
  application logic.
- RLS is responsible for filters that are required due to the multi-tenant database design.

## Cases where RLS isn’t a great fit

Every technology has its tradeoffs and cases where you shouldn’t use it. Here are two cases where we
think RLS isn’t a great fit:

#### If you need stronger isolation between tenants

RLS in a multi-tenant db isolates access to database rows, but all other database resources are
still shared between tenants. It doesn’t help with limiting the disk space, CPU, or db cache used
per tenant. If you need stronger isolation at the db level, you will need to look elsewhere.

#### If you have sophisticated access policies

As you will see in the next section, our current access policy is fairly simple - tenants are
isolated from each other, and within a tenant, you have administrators with additional access. More
mature access control policies such as RBAC/ABAC require their own schema design and can be more
challenging to integrate with RLS and even more challenging to make performant.

We’ve recently started the design for the RBAC/ABAC feature in Nile (talk to us if you are
interested in joining the conversation), and we will have a follow-up blog with recommendations on
best practices for adding RBAC/ABAC to multi-tenant SaaS.

## Implementation challenges

### A few gotchas

**One gotcha we encountered was that RLS doesn’t apply to superusers and table owners.** According
to
the [Postgres docs](https://www.postgresql.org/docs/current/ddl-rowsecurity.html#:~:text=Superusers%20and%20roles%20with%20the,security%20with%20ALTER%20TABLE%20...):

> _“Superusers and roles with the BYPASSRLS attribute always bypass the row security system when
> accessing a table. Table owners normally bypass row security as well, though a table owner can
> choose to be subject to row security with ALTER TABLE ... FORCE ROW LEVEL SECURITY.”_

Both of the blogs I shared earlier create a user called `app_user` that’s used in the application.
We did this as well, locally, but didn’t change the database user when deploying to our testing
environment. Thankfully, we caught and fixed this issue quickly.

**Another issue we caught during testing was that some requests were being authorized with a
previous
request’s user id.** We discovered that since the user id for RLS was being stored in
thread-local storage and threads were being reused for requests, it was necessary to set up a
post-response handler to reset thread-local storage.

Overall, so far the gotchas haven’t been too tricky to diagnose and resolve, but as one might expect
with anything security-related, they do have serious consequences if not addressed.

### Initial widespread code changes

Although RLS addresses the problem of continuous widespread changes well (see [“Defined once, applied
everywhere”](https://thenile.dev/blog#defined-once-applied-everywhere)), initially switching from dynamic queries to RLS requires more code changes than you
might think. Here’s an example of how RLS might affect an API endpoint to update an organization
that’s only callable by users in that org:

```typescript title="before_and_after_rls.java"
/*
** ---- Without RLS ---- **

1. Check if user is a member of the org
  a. If so, execute the update query
  b. Else, return a 404
*/

Org update(userId, orgId, updatePayload) {
  if (dao.isOrgMember(userId, orgId)) {
    return dao.updateOrg(updatePayload);
  } else {
    throw new NotFoundException();
  }
}

/* -- DAO layer -- */

boolean isOrgMember(userId, orgId) {
  return query("EXISTS(SELECT 1 ...)");
}

Org updateOrg(updatePayload) {
  return query("UPDATE orgs SET ... RETURNING *");
}

/*
** ----- With RLS ---- **

1. Execute the update query
  a. If the org was returned from the db, return the org in the response
  b. Else, return a 404
*/

Org update(userId, orgId, updatePayload) {
    Optional<Org> maybeOrg = dao.updateOrg(updatePayload);
    if (maybeOrg.isPresent()) {
      return maybeOrg.get();
    } else {
       throw new NotFoundException();
    }
}

/* -- DAO layer -- */

Optional<Org> updateOrg(updatePayload) {
  return query("UPDATE orgs SET ... RETURNING *");
}
```

In this example, authorization without RLS is done before writing to the db. With RLS, since
authorization is determined at query time, write queries might fail so error handling has to be
pushed down to the db level. This isn’t a mind-boggling change but is one you should keep in mind
when planning to add RLS in any project that involves a multi-tenant db.

## The gaps between blog-ready and production-ready RLS

### Recursive permission policies

Let’s say you want to add an admin user type and implement the following access rules:

1. Users can read, update, and delete their own user profiles.
2. Users can read the profiles of other users who belong to the same tenant.
3. Users with admin access can read, update, and delete other users who belong to the same tenant.

The first two use cases are possible with straightforward RLS policies, but the third isn’t. This is
because we must query the `users` table to see if the user in question is an admin (
i.e: `SELECT 1 FROM users WHERE id = current_app_user() AND is_admin = TRUE`). Since querying a
table triggers its RLS policy checks, executing this query within a `users` RLS policy will
trigger `users` RLS policy checks, which will call this query, which will trigger RLS policy checks,
resulting in an infinite loop. Postgres will catch this error instead of timing out, but you should
make sure to test your policies so this doesn’t happen at runtime. You can avoid this problem by
defining a function with `SECURITY DEFINER` permissions that’s to be used in the RLS
policy. According to
the [Postgres docs](https://www.postgresql.org/docs/current/sql-createfunction.html#:~:text=SECURITY%20DEFINER%20specifies%20that%20the,functions%20not%20only%20external%20ones.):

> _"`SECURITY DEFINER` specifies that the function is to be executed with the privileges of the user
> that owns it."_

In our case, this user is the superuser that you probably used to set up your database. So they
bypass RLS.

:::note
By using `SECURITY DEFINER` you are allowing users to bypass the security policy and use
superuser
privileges regardless of who they really are, so you must be careful. I recommend reviewing the
“[Writing SECURITY DEFINER Functions Safely](https://www.postgresql.org/docs/current/sql-createfunction.html#:~:text=Writing%20SECURITY%20DEFINER%20Functions%20Safely&text=For%20security%2C%20search_path%20should%20be,be%20used%20by%20the%20function.)
” section of the Postgres documentation before using this capability.
:::

Here’s an example of how to implement RLS policies that satisfy the three use cases above:

```sql title="complex_rls_policy.sql"
CREATE
  TABLE
    users(
      id SERIAL PRIMARY KEY,
      is_admin BOOLEAN
    );

ALTER TABLE
  users ENABLE ROW LEVEL SECURITY;

-- Users can do anything to themselves.
CREATE
  POLICY self_policy ON
  users
    USING(
    id = current_app_user()
  );

CREATE
  FUNCTION is_user_admin(
    _user_id INTEGER
  ) RETURNS bool AS $$ SELECT
    EXISTS(
      SELECT
        1
      FROM
        users
      WHERE
        id = _user_id
        AND is_admin = TRUE
    ) $$ LANGUAGE SQL SECURITY DEFINER;

CREATE
  FUNCTION do_users_share_org(
    _user_id_1 INTEGER,
    _user_id_2 INTEGER
  ) RETURNS bool AS $$ SELECT
    EXISTS(
      SELECT
        1
      FROM
        org_members om1,
        org_members om2
      WHERE
        om1.user != om2.user
        AND om1.org = om2.org
        AND om1.user = _user_id_1
        AND om2.user = _user_id_2
    ) $$ LANGUAGE SQL SECURITY INVOKER;

-- Non-admins can only read users in their orgs.
CREATE
  POLICY read_in_shared_orgs_policy ON
  users FOR SELECT
      USING(
      do_users_share_org(
        current_app_user(),
        id
      )
    );

CREATE
  POLICY admin_policy ON
  users
    USING(
    do_users_share_org(
      current_app_user(),
      id
    )
    AND is_user_admin(
      current_app_user()
    )
  );
```

Note the use of the `do_users_share_org()` `SECURITY INVOKER` function. According to
the [Postgres docs](https://www.postgresql.org/docs/current/sql-createfunction.html#:~:text=SECURITY%20INVOKER%20indicates%20that%20the,the%20user%20that%20owns%20it.):

> _“SECURITY INVOKER indicates that the function is to be executed with the privileges of the user
> that
> calls it.”_

In our case, this is `app_user` (who doesn’t bypass RLS), so we just define these functions for
reusability purposes.

### Logging

It’s important to set up logging before shipping any feature to production. This is especially true
with RLS where logging the execution of the actual
policies [isn’t directly possible](https://dba.stackexchange.com/questions/299963/log-occurrences-when-row-level-security-policies-filter-out-rows)
. For each request, it’s helpful to log the user and tenant IDs to be used for RLS when:

- Parsing them from auth headers
- Setting and getting them from thread-local storage
- Setting them in the db connection
- This makes it easier to identify bugs related to thread-local storage
  When resetting them in thread-local storage

It’s also a good idea to enable more detailed logging in the db, at least initially, to see the
values actually being inserted/retrieved. If policies return too few/many results, or inserts fail
unexpectedly, it’s easier to figure out what went wrong.

## Testing

In multi-tenant SaaS, guaranteeing the security of each tenant is critical. We have an extensive
suite of integration tests that test every access pattern to make sure that nothing ever leaks. The
tests spin up a Postgres [Testcontainer](https://www.testcontainers.org/modules/databases/postgres/)
and call the relevant API endpoints, checking that proper access is always enforced.

In order to minimize the execution time of a large suite of integration tests, we avoid setup and
teardown of the database between tests and annotate the order in which tests run to make sure the
results are deterministic even without a full cleanup in between tests. As we scale, we’ll look into
other options like property-based testing and parallelizing our tests.

The switch from dynamic queries to RLS has been seamless in our integration tests. All we had to do
was to make sure our tests were using the newly-created `app_user` that doesn’t bypass RLS.

## Conclusion

Every modern SaaS product is multi-tenant, but the good ones are also scalable, cost-effective, and
maintainable. Scalability and cost-effectiveness are the results of careful system design.
Maintainability includes design considerations such as
the [DRY principle](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) (don’t repeat yourself)
and a separation of concerns, which make mistakes less likely and testing and troubleshooting
easier.

As we’ve shown, a single-schema multi-tenant database with RLS ticks all the checkboxes for
scalable, cost-effective, and maintainable architecture. This blog includes everything you need to
get started with your own multi-tenant SaaS architecture. But if this seems like too much and you’d
rather have someone else handle this for
you - <a href="https://thenile.dev?utm_source=blog.nile&utm_medium=blog&utm_campaign=pg_rls_blog&utm_content=bottom_link" rel="">
talk to us</a> :)
